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Abstract. Many inverse and parameter estimation problems can be written 
as PDE-constrained optimization problems. The goal is to infer the parameters, 
typically coefficients of the PDE, from partial measurements of the solutions of the 
PDE for several right-hand-sides. Such PDE-constrained problems can be solved 
by finding a stationary point of the Lagrangian, which entails simultaneously 
updating the parameters and the (adjoint) state variables. For large-scale 
problems, such an all-at-once approach is not feasible as it requires storing all 
the state variables. In this case one usually resorts to a reduced approach where 
the constraints are explicitly eliminated (at each iteration) by solving the PDEs. 
These two approaches, and variations thereof, are the main workhorses for solving 
PDE-constrained optimization problems arising from inverse problems. In this 
paper, w r e present an alternative method that aims to combine the advantages of 
both approaches. Our method is based on a quadratic penalty formulation of the 
constrained optimization problem. By eliminating the state variable, we develop 
an efficient algorithm that has roughly the same computational complexity as the 
conventional reduced approach while exploiting a larger search space. Numerical 
results show that this method indeed reduces some of the non-linearity of the 
problem and is less sensitive to the initial iterate. 
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In inverse problems, the goal is to infer physical parameters (e.g., density, soundspeed 
or conductivity) from indirect observations. When the underlying model is described 
by a partial differential equation (PDE) (e.g., the wave-equation or Maxwell's 
equations), the observed data are typically partial measurements of the solutions of 
the PDE for multiple right-hand-sides. The parameters typically appear as coefficients 
in the PDE. These problems arise in many applications such as geophysics [TJ [3 S E], 
medical imaging \5\ [B] and non-destructive testing. 

For linear PDEs, the inverse problem can be formulated (after discretization) as 
a constrained optimization problem of the form 

min h\\Pu - d||? s.t. A(m)u = q, (1) 

m.u “ 

where m £ R A/ represents the (gridded) parameter of interest, yi(m) £ C ;V x A and 
ci £ C A represent the discretized PDE and source term, u £ C‘ v is the state variable 
and d £ C L are the observed data. The measurement process is modelled by sampling 
the state with P £ R AxA . Throughout the paper r denotes the (complex-conjugate) 
transpose. 

Typically, measurements are made from multiple, say K, independent 
experiments, in which case u £ C A ‘ V is a block vector containing the state variables 
for all the experiments. Likewise, q £ C AA and d £ C AL are block vectors containing 
the right-hand-sides and observations for all experiments. The matrices j4(m) and P 
will be block-diagonal matrices in this case. Typical sizes for M, N, K , L for seismic 
inverse problems are listed in table [T] 

In practice, one usually includes a regularization term in the formulation (jlj> to 
mitigate the ill-posedness of the problem. To simplify the discussion, however, we 
ignore such terms with the understanding that appropriate regularization terms can 
be added when required. 

1.1. All-at.-once and reduced methods 

In applications arising from inverse problems, the constrained problem (jT|) is typically 
solved using the method of Lagrange multipliers [3 [§] or sequential quadratic 
programming (SQP) [Ql[T()]. This entails optimizing over both the parameters, states 
and the Lagrange multipliers (or adjoint-state variables) simultaneously. While such 
all-at-once approaches are often very attractive from an optimization point-of-view, 
they are typically not feasible for large-scale problems since we cannot afford to 
store the state variables for all K experiments simultaneously. Instead, the so-called 
reduced approach is based on a (block) elimination of the constraints to formulate an 
unconstrained optimization problem over the parameters: 

min |||P/l(m) _1 q - d|||. (2) 

rn 

While this eliminates the need to store the full state variables for all K experiments, 
evaluation of the objective and its gradient requires PDE-solves. Moreover, by 
eliminating the constraints we have dramatically reduced the search-space, thus 
arguably making it more difficult to find an appropriate minimizer. Note also that 
the dependency of the objective on m is now through 24(m) _1 q in stead of through 
A(m)u. For linear PDEs, the latter can often be made to depend linearly on m while 
the dependency of -4(m)~'q on m is typically non-linear. 


A penalty method for inverse problems 
1.2. Motivation 


3 


The main motivation for this work is the observation that the stated inverse problem 
would be much easier to solve if we had a complete measurement of the state (i.e., P 
is invertible). In this case, we could reconstruct the state from the data as u = P -1 d 
and subsequently recover the parameter by solving 

min i||j4(m)u - q|||, (3) 

rn ~ 

which for many linear PDEs would lead to a linear least-squares problem. This 
approach is known in the literature as the equation-error approach |111 [12]. When 
we do not have complete measurements, this method does not apply directly since we 
cannot invert P. We can, however, aim to recover the state by solving the following 
(inconsistent) overdetermined system 



which combines the physics and the data. We can subsequently use the obtained 
estimate of the state to estimate m by solving Q. These steps can be repeated in an 
alternating fashion as needed. In a previous paper [T3], we proposed this methodology 
for seismic inversion and coined it Wavefield Reconstruction Inversion (WRI). We 
showed - via numerical experiments - that this approach can mitigate some of the 
(notorious) non-linearity of the seismic inverse problem. In the current paper we seek 
to analyse this approach in more detail and broaden the scope of its application to 
inverse problems involving PDEs. 

1.3. Contributions and outline 

We give the above sketched method a sound theoretical basis by showing that it can 
be derived from a penalty formulation of the constrained problem: 

min i||P U - d||l + ^||.4(m)u - q|||, (5) 

the solution of which (theoretically) satisfies the optimality conditions of the 
constrained problem (JTJ as A —» oo. Such reformulations of the constrained problem 
are well-known but are, to oiu' best knowledge, not being applied to inverse problems. 

The main contribution of this paper is the development of an efficient algorithm 
based on the penalty formulation for inverse problems and the insight that we can 
approximate the solution of the constrained problem (JTJ) up to arbitrary finite precision 
with a finite A. In (ill-posed) inverse problems, we often do not require very high 
precision because we can only hope to resolve certain components of m anyway. We 
can understand this by realizing that not all constraints are equally important; those 
that constrain the null-space components of m need not be enforced as strictly. Thus, 
the parameter A need only be large enough to enforce the dominant constraints. The 
numerical experiments suggest that a single fixed value of A is typically sufficient. 

Our approach is based on the elimination of the state variable, u. from <j5j). This 
reduces the dimensionality of the optimization problem in a similar fashion as the 
reduced approach j2j) does for the constrained formulation (JTJ) by solving K systems 
of equations. This elimination leads to a cost function <j)\(n 1 ) whose gradient and 
Hessian can be readily computed. The main difference is that the state u in this case 
is not defined by solving the PDE, but instead is solved from an overdetermined system 
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that involves both the PDE and the data (|4j>. Due to the special block-structure of 
the problems under consideration, this elimination can be done efficiently, leading to 
a tractable algorithm. Contrary to the conventional reduced approach, the resulting 
algorithm does not enforce the constraints at each iteration and arguably leads to a 
less non-linear problem in m. It is outside the scope of the current paper to give a 
rigorous proof of this statement, but we present some numerical evidence to support 
this conjecture. 

The outline of the paper is as follows. First, we give a brief overview of the 
constrained and penalty formulations in sections [2]and[3] The main theoretical results 
are presented in section [4] while a detailed description of the proposed algorithm is 
given in section [5] Here, we also compare the penalty approach to both the all-at-once 
and the reduced approaches in terms of algorithmic complexity. Numerical examples 
on a ID DC-resistivity and 2D ultrasound and seismic tomography problems are given 
in section [O] Possible extensions and open problems are discussed in section [7] and 
section [8] gives the conclusions. 


2. All-at-once and reduced methods 


A popular approach to solving constrained problems of the form Q is based on the 
corresponding Lagrangian: 


£(m, u. v) = I||Pu - d||l + v r (A(m)u - q), 


( 6 ) 


where v € C KN is the Lagrange multiplier or adjoint-state variable [13117]. A necessary 
condition for a solution (m*,u*,v*) of the constrained problem (JIJ is that it is a 
stationary point of the Lagrangian, i.e. V£(m*, u". v“) = 0. The gradient and Hessian 
of the Lagrangian are given by 

C n 


and 


where 


V£(m, u, v) = 


V 2 £(m, u,v) = 


A 

R{ m, u, v) 
A'(m.v) 
G(m, u) 


G(m, u) r v, 

A( m) r v + P T (Pu - d), 
A(m)u — q, 

A(m,v) T G(m,u)' 


p T p 

yl(m) 


A(m) 

0 


(7) 


( 8 ) 


g(m,u) = 8 ' 4 < m|U , A’(m,v) = 


R( m, u, v) = 


dm 
dG(m, u) T v 
dm 


9A( m) T v 
dm 


These Jacobian matrices are typically sparse when A is sparse and can be computed 
analytically. 


2.1. All-at-once approach 

So-called all-at-once approaches find such a stationary point by applying a Newton¬ 
like method to the Lagrangian [71 [T5]* A basic algorithm for finding a stationary point 
of the Lagrangian up to a given tolerance e is given in Algorithm [T] If the algorithm 
converges it returns iterates (m*,u*,v*) such that || V£(m*,u*, v *)||2 < c. Many 
variants of Algorithm [l] exist and may include preconditioning, inexact solves of the 
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Algorithm 1 Basic Newton algorithm for finding a stationary point of the Lagrangian 
via the all-at-once method_ 

Require: initial guess m°,u°.v 0 , tolerance e 
A: = 0 

while ||V£(m fc ,u A ’, v fc )|| 2 >edo 

/ Sm k \ 

Su k = - (V 2 £(m fc , u k , v fc )) _1 V£(m fc , u k \ v k ) 

\ Svk ) 

determine steplength a k (0,1] 
m A+1 = m A + a k Sm k 
u A ' +1 = u A + a k 8u k 
v A+1 = v A: + a k 6v k 
k = k +1 
end while 


KKT system (V 2 £) 1 V£ and a linesearch to ensure global convergence |151 [151 18]. 
For an extensive overview we refer to m 

An advantage of such an all-at-once approach is that it eliminates the need to 
solve the PDEs explicitly; the constraints are only (approximately) satisfied upon 
convergence. However, such an approach is not feasible for the applications we have in 
mind because it involves simultaneously updating (and hence storing) all the variables. 

2.2. Reduced approach 

In inverse problems one usually considers a reduced formulation that is obtained by 
eliminating the constraints from (JTJ). This results in an unconstrained optimization 
problem: 

nun {0(m) = ±||Pu rc , d (m) - d|||} . (9) 

where u re d(m) = A(m) _1 q. The resulting optimization problem has a much smaller 
dimension and can be solved using black-box non-linear optimization methods. In 
contrast to the all-at-once method, the constraints are satisfied at each iteration. 

The gradient and the Hessian of 0 are given by 

V0(m) = G(m, u red ) r v r ed, (10) 

V 2 0(m) = G(m, u re d) r -4(m) _/ P r P.4(m) -1 G(m, u re d) 

- A'(m. v rcd ) T 4(m) -1 G(m,u red ) 

- G(m, u red ) 7 .4(m)~ r A'(m. v red ) 

+ P(m.u red ,v red ), (11) 

where v red = .4(m)“ 7 P (d - Pu rcd ). 

A basic (Gauss-Newton) algorithm for minimizing 0(m) is given in Algorithm 
[2] Note that this corresponds to a block-elimination of the KKT system and the 
iterates automatically satisfy £ u (m A \ u^ d , vj" ed ) = £ v (m A , uf ed , v A ed ) = 0. If the 
algorithm terminates successfully, the final iterates (m*. u* ed , v* ed ) additionally satisfy 
||£m(m*,u* cd ,v r * ed )|| 2 < e, so that ||VG(m*,u* ed , v r * ed )|| 2 < e. 
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Algorithm 2 Basic Gauss-Newton algorithm for find a stationary point of the 
Lagrangian via the reduced method 
Require: initial guess m°, tolerance e 
k = 0 

<,1 = A( m^-'q 
v r ° ed = A(m°)- T P(d - Pul,) 
while ||£ m (m fc ,u* d> v£ d )|| 2 > « do 
gf ed = G(m^u^ ed )^vf ed 

Hi, = G(m fc , Uj ed ) T i4(m fc )- T P T Pyl(m fc ) _1 G(m fc , u|f ed ) 

determine steplength a k (0,1] 

uJS 1 = i4(m fe + 1 )- 1 q 

v^‘ = A(m^)~ T P(d - PuH 1 ) 

k = k + 1 

end while 


A disadvantage of this approach is that it requires the solution of the PDEs at each 
update, making it computationally expensive. It also strictly enforces the constraint 
at each iteration, possibly leading to a very non-linear problem in m. 


3. Penalty and augmented Lagrangian methods 


It is impossible to do justice to the wealth of research that has been done on penalty 
and augmented Lagrangian methods in one section. Instead, we give a brief overview 
high-lighting the main characteristics of a few basic approaches and their limitations 
when applied to inverse problems. 

A constrained optimization problem of the form Q can be recast as an 
unconstrained problem by introducing a non-negative penalty function n : C jV —> R>o 
and a penalty parameter A > 0 as follows 

min h\\Pu — dll? + A7r(A(m)u - q). (12) 

m.u - 


The idea is that any departure from the constraint is penalized so that the solution of 
this unconstrained problem will coincide with that of the constrained problem when 
A is large enough. 

A quadratic penalty function 7r(-) = ||| • ||| leads to a differentiable unconstrained 
optimization problem (12) whose minimize!* coincides with the solution of the 
constrained optimization problem (JTJ) when A —» oo [l4l Tlim. 17.1]. Practical 
algorithms rely on repeatedly solving the unconstrained problem for increasing values 
of A. 


A common concern with this approach is that the Hessian may become 
increasingly ill-conditioned as A —> oc when there are fewer constraints than variables. 
For PDE-constrained optimization in inverse problems, there are enough constraints 
(A(m) is invertible) to prevent this. We discuss this limiting case in more detail in 
section [3 
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For certain non-smooth penalty functions, such as 7r(-) = || • ||i, the minimizer of 
© is a solution of the constrained problem for any A > A* for some A* [Til Thm. 
17.3]. In practice, a continuation strategy is used to find a suitable value for A. An 
advantage of this approach is that A does not become arbitrarily large, thus avoiding 
the ill-conditioning problems mentioned above. A disadvantage is that the resulting 
unconstrained problem is no longer differentiable. With large-scale applications in 
mind, we therefore do not consider exact penalty methods any further in this paper. 

Another approach that avoids having to increase A to infinity is the augmented 
Lagrangian approach (cf. [H]). In this approach, a quadratic penalty A||A(m)u- q||| 
is added to the Lagrangian (JfiJ). A standard approach to solve the constrained problem 
based on the augmented Lagrangian is the Alternating direction method of multipliers 
(ADMM). In its most basic form it relies on minimizing the augmented Lagrangian 
w.r.t. (m, u) and subsequently updating the multiplier v and the penalty parameter 
A [151 [15] . This would require us to store the multipliers, which is not feasible for the 
problems we have in mind. 

In the next two sections, we discuss a computationally efficient algorithm for 
solving the constrained optimization problem <JTJ> based on a quadratic penalty 
formulation. This formulation is attractive because it leads to a differentiable, 
unconstrained, optimization problem. Moreover, the optimization in u has a closed- 
form solution which can be computed efficiently, making it an ideal candidate for the 
type of problems we have in mind. 


4. A reduced penalty method 

Using a quadratic penalty function, the constrained problem ([!]) is reformulated as 

m'u { P ( m ' U ) = 5ll Pu - d lli + f A P( m ) U -q|l2}- (13) 

The gradient and Hessian of P are given by 

AG(m,u) T (.A(m)u — q) 

P T (Pu - d) + A>t(m) T (,4(m)u - q) 


VP = 


Pm 

Pu 


and 


where 


V 2 P = 


Pm.m Pm., 
Pu.m Pu.t 

\T, 


(14) 

(15) 


Pm.m = A(G(m. u) J G(m. u) + R( m. u. ^l(m)u - q)), (16) 

Pu.u =P T P + \A(m) T A{m), (17) 

Pm.u = A(A'(m, A(m)u - q) + yl(m) 7 'G(m. u)). (18) 

Of course, optimization in the full (m, u)-space is not feasible for large-scale problems, 
so we eliminate u by introducing 

ux(m) = argmin'Ptm. u), (19) 

u 

and define a reduced objective: 

<£a( m) = ■p(m,u A (m)). 

The optimization problem for the state ( fTof has a closed-form solution: 

u A = (yl(m)P4(m) + X~ l P T P)~ l (A(m) r q + A" l Pd). 


( 20 ) 
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The modified system A 1 A + X~ 1 P I P is a low-rank modification of the original PDE 
and incorporates the measurements in the PDE solve. This is the main difference with 
the conventional reduced approach (cf. Algorithm 2j); the estimate of the state is not 
only based on the physics and the current model, but also on the data. 

Following [201 Thm. 1], it is readily verified that the gradient and Hessian of (j)\ 
are given by 

V0A(m) = P m (m, u A ), (21) 

V 2 0a( m) = 'Pm.m(m-UA) 

- Pm,u(m, ua) CPu,u(ni. ua))- 1 Pu,m(m, ua). (22) 

Note that V 2 0 a is the Schur complement of V 2 P. 

A basic Gauss-Newton algorithm for minimizing (p\ is shown in Algorithm [3] Note 
that the computation of the adjoint-state does not require an additional PDE-solve 
in this algorithm. Instead, the forward and adjoint solve are done simultaneously via 
the normal equations. 

Algorithm 3 Basic Gauss-Newton algorithm for find a stationary point of the 
Lagrangian via the penalty method 

Require: initial guess m°, penalty parameter A, tolerance e 
k = 0 

u" = (yl(m°) T /l(m°) + A ~ 1 P T P) 1 (,4(m°) T q + A -1 .Pd) 
v“=A(.4(m»K-q) 
while ||£ m (m*\u 5 ;,v $;)||2 > e do 
g$ = G(m fc ,u^v(; 

//*; = A G T (/ - A ( A T A + A - l P r P)~ l A T ) G 
determine steplength a k 6 (0,1] 
m fc+1 = m k - a k (H k )~ l g k 

u* v +1 = ( A (m t+1 ) r -4 (m fc+1 ) + A-'P T P) _1 (A(m k+1 ) T q + X~ l Pd) 

v* +1 = A(Al(m fc+1 )u* +1 -q) 
k = k + 1 

end while 


Next, we show how the states, and v^, generated by this algorithm relate to 
the states generated by the reduced approach and subsequently that if the algorithm 
successfully terminates the iterates (m*,u^,v^) satisfy 

IIVAm-.ul.vrOlb^f + OfA- 1 ). 

Lemma 4.1 For a fixed m, the states u\ and v\ used in the reduced penalty approach 
(Algorithm are related to the states u rec i and v re d used in the reduced approach 
(Algorithm ^D as follows 

ua = u rcd + 0(A _1 ), 

VA = V red + 0(A _1 ). 

Proof The state variables used in the penalty approach are given by 
u A = ( A r A + A ~ 1 P r P ) _1 (yl r q + A-‘P 2 'd), 


(23) 

(24) 
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v A = X(Au x - q). 

The former can be re-written as 

U A = .4- 1 (I + X - 1 A~ t P t PA- 1 )- 1 (q + X- 1 A~ T P T d). 

For A > ||P.4 -1 H 2 we may expand the inverse as (I + A _1 P) _1 % I - X~ l B + A ~ 2 B 2 + 

..and find that 

u A = ^l _1 q 

+ A- 1 /l- 1 /l- r P T (d - PA-'q) 

- X- 2 A~ l (PA-'^PA- 1 ) A~ T P r (d - PA-'q) + 0( A- 3 ) 

= u rod + A" 1 A" 1 (/ - A" 1 (PA~ 1 ) r (PA~ 1 )^ v red + C(A- 3 ).(25) 
We immediately find 

VA = V red - A- 1 (PA-'flPA- 1 ) v red + Q{ A- 2 ). (26) 

I 

Remark Lemma |4.1| suggests a natural scaling for the penalty parameter; A > 
\\PA~ l \\l can be considered large, while A < || 1 1 |2 can be considered small. 

Theorem 4.2 At each iteration of algorithm^ the iterates satisfy ||£ u (m fe , u^, vj) 11 2 = 
0 and 11 £ v (, 11 $, v§) || 2 = 0( A ~ 1 ). Moreover, if algorithm [$| terminates successfully 
at m* for which ||£ m (m*,uJ, vj)|| 2 < 6, we have ||V£(m*, u^, vj)|| 2 < e 4- 0(A —1 ). 

Proof Using the definitions of \i\ and va we find for any m 
£ u (m. u A , v A ) = A(m) T v x + P{Pu x - d) 

= XA t (Au x - q) + P(Pu x - d) = 0. (27) 

Using the approximations for ua and va for A > 11 /^yl —1 11§ presented in Lemma |4. 1[ 
we find 

£ v (m. ua, va) = j4(m)u A - q 

= A-^.ed + 0(A" 2 ). (28) 

Thus we find 

||/I v (m*,u^vI)|| 2 <0(A- 1 ). (29) 

At a point m* for which ||£ m (m*, u^, vj)|| 2 < e we immediately find that 
||V£(m’,UA, Va)||! = 

||£ m (m*, uj, vj)||| + ||£ u (m*, uj, Va)||| + ||£ v (m*, u)(, Va)|| 2 
< € 2 + 0(A” 2 ), (30) 

and hence that 


I 


||V£(m’,UA,VA)|| 2 < e + 0(A _1 ). 


(31) 
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This means we can use algorithm [3 to find a stationary point of the Lagrangian 
within finite accuracy (in terms of ||V£ I 2 ) with a finite A. I 11 order to reach a given 
tolerance, we need A _1 ||£ v ||2 to be small compared to ||£ m || 2 - This condition can 
easily be verified for a given m and thus serve as the basis for a selection criterion for 
A. 

In an ill-posed inverse problem, we can only hope to reconstruct a few components 
of m*, roughly corresponding the dominant eigenvectors of the reduced Hessian. Thus, 
the tolerance e can be quite large, even for an acceptable reconstruction. Driving 
down the norm of the gradient any further would only refine the reconstruction in the 
eigenmodes corresponding to increasingly small eigenvalues and would hardly affect 
the final reconstruction and datafit. 

Next, we derive an expression for the distance of the final iterate of algorithm |3| 
mj. to a stationary point of the Lagrangian, m* in terms of the chosen tolerance, e, 
and the data-error rj = \ | d — PA ( m * x ) ~ 1 q 11 2 . 

Theorem 4.3 Given a stationary point of the Lagrangian (nT,iT,v*) and the final 
iterate of algorithm [3] 111 \ such that ||£ m (m^. u^, vj)||2 < e, we have ||m^ — m *||2 < 
k(H) (7 + A -1 ?/^ where II = J T J is the reduced. Gauss-Newton Hessian with condition 
number k{H), 7 = e /||//||2 is the scaled tolerance, rj = ||d — (mJ) — 1 q11 2 /1111 2 is 
the scaled data-error and A = A/||PA -1 ||2 is the scaled penalty parameter. 

Proof We expand the gradient of the Lagrangian at the stationary point as 


^m(ni^ 

.“A.vJ) ^ 





Rulin'. 

u v v a) 

* 




£v(m“, 

Ua.v*) ) 





u 

*,v*) K 

(m*,v*) T 

G(m*,u*) T ^ 

1 ( m A - 

- m* 

A'(m*. 

V*) 

P T P 

,4(m* ) T 

U A 

- u* 

G(m*, 

U*) 

A(m') 

«) ) 

1 V vs- 

- V* 


|(32) 


Using the expressions for the gradient of the Lagrangian at (m^,u^,v^) obtained in 
we can obtain an expression for — 111 * etc. by solving 


Theorem 


4.2 


R I< T G t 
I< P T P A T 
G A 0 



£ m (m*,u*,v*) 

0 

A-iv' + C^A- 2 ) 


Eliminating the bottom two rows we find 

H(ml - in') = £ m (mj,uj,vj) + A _1 Fv* + 0( A" 2 ), 

where 

// = /?- K T A~ l G - G t A~ t K + G t A~ t P t PA- 1 g 
is the reduced Hessian and 

F = G t A~ t P t PA~ 1 - K t A~ 1 . 


(33) 


(34) 


Expressing v* as 

v = A~ t P t ( d - Pi4(mJ) -1 q), 

and ignoring the second order terms in II and F , i.e, 
H « G t A~ i P , PA~ i G. 
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F as G t A- t P t PA~ 1 

we find 

IlmX - m“H 2 < IIH^Ha (e + \- 1 \\G T A- T P T \\ a t ] ) , 

where rj = ||d — PA(mJ) - 1 q ||2 denotes the data-error. 

Introducing the scaled tolerance and data-error we express this as 

\\m* x - m *|| 2 < k(H) . 

As noted before, we can only hope to reconstruct a few components of m*, given by 
the dominant eigenvectors of II. Thus, even for an acceptable reconstruction, the 
error \m* x — m *||2 can be relatively large. On the other hand, the data-error for the 
corresponding m x can be quite small, even if \\m x — m *||2 is large. We expect the 
penalty method to yield an acceptable reconstruction when A -1 77 is small compared 
to e. 


5. Algorithm 

In this section, we discuss some practicalities of the implementation of algorithm [3| 
We slightly elaborate the notation to explicitly reveal the multi-experiment structure 
of the problem. In this case, the data are acquired in a series of K independent 
experiments and d = [di:...; d/ v ] is a block-vector. We partition the states and 
sources in a similar manner. Since the experiments are independent, the system 
matrix A is block-diagonal matrix with K blocks A, (m) of size N x N. Similarly, the 
matrix P consists of blocks Pi. Recall that we collect L independent measurements 
for each experiment, so the matrices Pi € M AxL have full rank. 


5.1. Solving the augmented PDE 


Due to the block structure of the problem, the lineal' systems can be solved 
independently. We can obtain the state u, by solving the following inconsistent 
overdetermined system 


( A(m) 

V *~ 1/2 Pi 


q«- ^ 

A-VM, ) ’ 


(35) 


in a least-squares sense. Assuming that all the blocks A. t and Pj are identical, we will 
drop the subscript i for the remainder of this subsection. Next, we will discuss various 
approaches to solving the overdetermined system (35). 

Factorization: If both A and P are sparse, we can efficiently solve the system 
via a QR factorization or via a Cholesky factorization of the corresponding Normal 
equations. I 11 many applications, P T P is a (nearly) diagonal matrix and thus the 
augmented system A 1 A+X~ l P 1 P has a similar sparsity pattern as the original system. 
Thus, the fill-in will not be worse than when factorizing the original system. 

Iterative methods: While we can make use of factorization techniques for small- 
scale applications, industry-scale applications will typically require (preconditioned) 
iterative methods. Obviously, we can apply any preconditioned iterative method 
that is suitable for solving least-squares problems, such as LSQR, LSMR or CGLS 
[211 1221 [22]. Another promising candidate is a generic accelerated row-projected 
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method described by (231 [25] which proved useful for solving PDEs and can be easily 
extended to deal with overdetermined systems |26| . 

To get an idea of how such iterative methods will perform we explore some of the 
properties of the augmented system. The augmented system A T A + \~ l P 1 P is a rank 
L modification of the original system A 1 A. It follows from (271 Thm 8.1.8] that the 
eigenvalues are related as 

l*n{A r A + A ~ 1 P T P) = fi n {A T A) + a n \-\n = 1,2 ,...,N (36) 
where p\{B) > p 2 (B) > ... > p^(B) denote the eigenvalues of B and the coefficients 
a n satisfy a n = L. This means that at worst, 1 eigenvalue is shifted by LA -1 
while at best, all the eigenvalues are shifted by LJV _1 A _1 . For the condition numbers 
k(B) = fii(B)/fi]sj(b) we find 

Cjf l K{A T A) < k(A t A + A ~ l P T P) < Cik(A t A ), (37) 

where C, = (l + w ^ 4) ). 

To illustrate this, we show a few examples for a ID (time-harmonic) parabolic 
PDE — d%) u = 0 and the ID Helmholtz equation (yr -|- d%) u = 0, both with 
Neumann boundary conditions. Both are discretized using first-order finite-differences 
on x £ [0,1] with N = 51 points for oj = l() 7 r. The sampling matrix P consists of L 
rows of the identity matrix (regularly sampled). The ratio of the condition numbers of 
A 1 A and A T A + \~ l P T P for the parabolic and Helmholtz equation are shown in tables 
[2]and[3j For these examples, the condition number of the augmented system is actually 
lower than that of the original system. The eigenvalues are shown in figures [T] and 
[2] These show that the actual eigenvalues distributions do not change significantly. 
We expect that iterative methods will perform similarly on the augmented system as 
they would on the original system. How to effectively precondition the augmented 
system given a good preconditioner for the original system is a different matter which 
is outside the scope of this paper. 

Direct methods: When the matrix has additional structure we might actually 
prefer a direct method over an iterative method. An example is explicit time-stepping, 
where the system matrix A exhibits a lower-triangular block-structure. In this case the 
action of A -1 can be computed efficiently via forward substitution, requiring storage of 
only a few time-slices of the state. The adjoint system A~ 7 can be solved by backward 
substitution, however, the full time-history of the state variable is needed to compute 
the gradient |28| . For the penalty method, the augmented system A T A + A ~ l P 1 P 
will have a banded structure and the system can be solved using a block version of 
the Thomas algorithm, which again would require storage of the full time-history. So 
even in this setting it seems possible to apply the penalty method at roughly the same 
per-iteration complexity as the reduced method. 


5.2. Gradient and Hessian computation 


Given these solutions u/, of (35|, the gradient, g,\ and Gauss-Newton Hessian II\ of 
(px are given by (cf eqs. | 21 | 22 [ ) 

K 

gx = XYl Gl(A k u k - q fc ), (38) 

k =1 

Hx = A£<#(/- Ak {AlA,. + A ~ l P k r P k y l A T k ) G k , (39) 

k=l 
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where Gk = G( m, u/,.). We can compute the inverse of (Aj.Ak 4- \~ l PkP[) in the 
same way as used when solving for the states. In practice, we would solve for one state 
at a time and aggregate the results on the fly. Moreover, the Gauss-Newton Hessian 
admits a natural sparse approximation H\ & G^Gk which has proved to work 

well in practice |29| . 

5.3. Choosing A 

An important aspect of the proposed method is the choice of the penalty parameter, 
A. Theorem |4.2| essentially states that we can expect to find a stationary point of the 
Lagrangian within finite accuracy with finite A as long as A” 1 IJ£ V ||2 is small compared 
to ||£ m 11 2 * We can easily keep track of this quantity - at no significant additional 
computational cost - during the iterations and increase A as needed. Initialization 
can be done by directly enforcing A _1 ||£ v ||2 to ho some fraction of ||£ m ||2 at the 
initial iterate. A natural scaling for A is suggested by Lemma 
where A > 1 is considered large and A < 1 is considered small. 

5.4■ Complexity estimates 

A summary of the leading order computational costs per iteration of the penalty, 
reduced and all-at-once approaches is given in table [ 4 } The storage and computation 
required for the reduced and penalty methods are of the same order in terms of A’, N 
and M. The PDE-solves in both the penalty and reduced approaches can be done 
independently and in parallel. This makes these approaches more attractive for large- 
scale problems from a computational point-of-view. 

We have argued in section [571] that it is plausible that the augmented system can 
be solved as efficiently as the original PDE. However, it is not clear how the penalty 
and reduced methods will compare in the required number of iterations, though we 
expect that for small A the optimization problem is less non-linear and hence easier 
to solve. To reach a given tolerance with the penalty method, however, we need a 
continuation strategy in A, adding to the cost of the penalty method. I 11 the next 
section we compare the actual computational costs on a few test-cases. 


4.1 


A = A|| PA 


-11 


6. Case studies 


The following experiments are done in Matlab, using direct factorization to solve the 
PDEs (with Matlab slash). We consider both a Gauss-Newton (GN) and a Quasi- 
Newton (QN) variant of the algorithms and use a weak Wolfe linesearch to determine 
the steplength. In the GN method the Hessian is inverted using conjugate gradients 
(peg) up to a relative tolerance of 5. The matrix-vector products are computed on 
the fly. For the QN method we use the L-BFGS inverse Hessian with a history size of 
5 [14]. We measure the cost of the inversion by counting the number of PDE solves as 
outlined in table [4] In all experiments, we set A relative to the lar gest eigenvalue of 


A 1 P 1 PA 1 at the initial iterate. This scaling is justified by lemma 4.1 We compare 


1-1 


the results for various fixed values of A and additionally show the results for an ad-hoc 
continuation strategy; performing a few iterations for increasing values of A using the 
result for each A as initial guess for the next. 

To avoid the inverse crime, we compute the data for the ground truth model on 
a finer grid than used for the inversion. 
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In these experiments, we illustrate that the penalty method: 

• converges to a stationary point of the Lagrangian within the predicted tolerance 
ofC^A- 1 ); 

• can give practically the same or better results as the reduced method at a lower 
computational cost; 

• is not overly sensitive to noise; 

• is less sensitive to the initial guess than the conventional approach. 

The Matlab code used to perform the experiments is available from https: //github. 
com/tleeuwen/Penalty-Method. 

6.1. ID DC resistivity 
We consider the PDE 

d t u(t,x) = d x (m(x)d x ) u(t,x), (40) 

on the domain x € [0,1] with Neumann boundary conditions. A finite-difference 
discretization in the temporal Fourier domain gives 

A(m) = zwdiag(w) -f D 1 diag(m )D, (41) 

where uj is the angular frequency, w = [|; 1 ,..., 1; A], m represents the medium 
parameter in the cell-centres and D is the N — 1 x N finite-difference matrix 

^-11 
-1 1 

. . 5 

\ -11/ 
with h = 1/ (N — 1). The Jacobian is given by 
G(m, u) = D 1 diag(Du). 

The ground-truth model is m(x) = 1 + e -10 ( 1-1 / 2 )" and we locate two sources and 
receivers on either end of the domain. The data are generated on a grid with N = 201 
points and we have K = L = 2. 

For the inversion we use N = 101 points. We use a GN method with e = 10” 9 , 
6 = 10 _,{ and include a regularization term j||Dm ||2 with a = 10 . The initial 
parameters are m° = 1. 

The results are shown in figure [3| The convergence plot, figure [ 3 ] (a), shows the 
predicted behaviour of the penalty method; the norm of the gradient of the Laplacian 
stalls at 0(A _l ). The convergence of the continuation strategy shows that it is possible 
to reach the desired tolerance by gradually increasing A (using A = {0.1,1,10,100} 
with a few iterations each). The resulting parameter estimates are very similar as can 
be seen in figure |3](b). The actual costs of the inversion are listed in table [5] The 
computational cost for the various approaches are of the same order of magnitude, 
except for A = 10, where more than twice as many iterations are required. 
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6.2. 2D Acoustic tomography 

Consider the 2D scalar wave-equation 

m(x)dtu(t,x) = V 2 u(t,x), (42) 

oni€HC R 2 with radiation boundary conditions y/m(x)dtu(t,x)—n(x)-Vu(t,x) = 0 
on dtt where n(x) is the outward normal vector. 

Discretization in the temporal Fourier domain leads to a scalar Helmholtz 
equation 

.4(m) = diag(s) — D 1 D, (43) 

where D = [J 2 % D\\ D 2 & I\] with A the (A’,- — 1) x A finite-difference matrix, /, the 
Nj x N{ identity matrix and Si = urrrii in the interior and s* = w 2 m,/ 2 4- iWy/mi/h 
on the boundary. The Jacobian is given by 

G{ m, u) = diag(s')diag(u), (44) 

where s[ = uj 2 in the interior and s\ = (cj 2 + VjJ/ y/nii )/2 on the boundary. 

The observation matrix P samples the solution at the receiver locations using 
2D linear interpolation while the point sources are defined using adjoint 2D linear 
interpolation. 

6.2.1. Ultrasound tomography The domain SI = [0,1] x [0,1] m is discretized using 
N\ x N ‘2 points. The ground-truth m* as well as the source and receiver locations 
are shown in figure 4] We use a single frequency of 5 kHz (i.e., u ; = 10 4 7r). The data 
for the ground-trut i model are generated using N\ = jV 2 = 101 while the following 
experiments are done with N\ = A r 2 = 51. 

Non-linearity: First, we investigate the sensitivity of the misfit functions <p and 
by plotting 0(m* +ajVi -fa 2 v 2 ) and FfliVi +a- 2 V 2 ) as a function of (ai, a 2 )- 

We take vi,v 2 to be slowly oscillatory modes as shown in figure [5] The misfit as a 
function of (ai,a 2 ) is shown in figure[fij We see a radically different behaviour for the 
reduced and penalty methods. 

The first exhibits strong non-linearity and some spurious stationary points while 
for small A the penalty misfit is much better behaved. For larger values A the penalty 
misfit starts to behave more like the reduced misfit as expected. 

Inversion: For the inversion, we include a regularization term ^||Dm ||2 with 
a = 2 and compare the GN method (e = 10 -6 , 6 = 10“ x ) to the QN method 
(e = 10 -6 ). The initial parameter mo is constant at j s 2 /m?. 

The results for the GN method are shown in figure [7] The convergence history, 
figure [ 7 ] (top, left), shows the predicted behaviour of the penalty method; the norm 
of the gradient of the Lagrangian stalls at 0(A _1 ) when using the penalty method. 
The convergence history of the continuation strategy shows that is possible to reach 
the desired tolerance by gradually increasing A (using A = {0.1,1,10,100,1000} with 
a few iterations each). Figure [7] (top, right) shows that the methods perform similarly 
in terms of reconstruction error. The resulting parameter estimates are very similar as 
can be seen in figure [7] (bottom). The actual costs of the inversion are listed in table 
|fi] The penalty method converges to the same error ||m A * — m*|| in less iterations and 
uses less PDE solves. Note that all methods start overfitting after a few iterations. 
This can be countered by including more appropriate regularization or stopping the 
iterations early. The point here is to show that the penalty method gives similar 
results as the reduced method. 
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The results for the QN method are shown in figure [8] The convergence history 
shows the same behaviour as the previous experiment. The costs of the inversion, 
shown in table [7] are slightly less than those of the GN method. As with the GN 
method, the penalty method converges in less iterations and uses less PDE solves per 
iterations. 

Sensitivity to noise: Results for the QN method on data with 10% Gaussian 
noise are shown in figure [fi] Figure [I()| shows the results on data with 20% Gaussian 
noise. These results show that the penalty approach is not overly sensitive to noise and 
gives very similar -even slightly better- results compared to the reduced approach. 

6.2.2. Seismic tomography Here, the domain SI = [0,5] x [0,20] km is discretized using 
N\ x N 2 points. The ground-truth nT as well as the source and receiver locations 
are shown in figure [IT] We use a frequency of 2 Hz (i.e., u = 47r). The data for the 
ground-truth are generated using Ni = 101, N 2 = 401 while the following experiments 
are done with N\ = 51. N 2 = 201. 

Sensitivity to the initial guess For the inversion, we include a regularization 
term ^||Dm ||2 with a = 5 and use the QN method (e = 10 -6 ). We will use two 
different initial guesses, I and II, depicted in figure [TT] and see whether the methods 
converge to the same final iterate. Initial iterate I is much closer to the ground 
truth than the initial iterate II. This can also be observed when looking at the data 
produced by these iterates. The first initial iterate produces data that differs only 
slightly horn the observed data and inversion is considered to be easy. The second 
initial iterate produces data that is shifted significantly with respect to the observed 
data and inversion is considered to be difficult. It should be noted, however, that a 
significant source of the error in the initial model is the region near the surface (z = 0). 
In practical applications, such large errors in this region of the model might not occur. 

The results for initial guess I (figure [Tl] middle) are shown in figure [l2] We see 
that both the reduced and penalty methods converge to roughly the same final iterate 
and are able to fit the data equally well. Starting from initial guess II (figure [TT] 
bottom), however, we see that the reduced and penalty methods converge to different 
final iterates. For small A, however, the penalty method converges to roughly the 
same iterate as when starting from a better initial guess. Looking at the data-fit, we 
observe that the penalty method for small A is still able to fit the data perfectly while 
the reduced method is not. In seismology, this phenomenon is called cycle-skipping. 

Figure [14] shows the convergence of the methods in terms of the data misfit 
||Pu — d ||'2 and the distance to the constraint ||.4(m)u — q|| 2 - We observe that, 
when starting from initial guess I, both the penalty and reduced methods converge 
to approximately the same point. For initial guess II the penalty method for A = 0.1 
and A = 1 needs a few more iterations, but still converges to the same point as for 
initial guess I. For A = 10 and the reduced method, however, the iterations stall at a 
relatively high data misfit. 

These experiments suggests that the penalty method indeed mitigates some of 
the non-linearity of the problem, allowing the optimization to converge to the same 
final iterate, even when the initial guess is further away from the ground truth. 

7. Discussion 

This paper lays out the basics of an efficient implementation of the penalty method 
for PDE-constrained optimization problems arising in inverse problems. While the 


A penalty method for inverse problems 


17 


initial results are promising, some aspects of the proposed method warrant further 
investigation. 

Even though the theoretical results suggest that the penalty approach can find a 
stationary point of the Lagrangian with finite precision with a finite A, it is not clear 
how to choose a suitable value for A a priori. Our analysis and results suggest that 
choosing A to be a small fraction of ||PA -1 ||2 at the initial iterate yields good results. 
A continuation strategy for A is needed if we want to guarantee finding a stationary 
point of the Lagrangian with preset tolerance. A natural way to do this seems to be 
detecting when the penalty method stalls and subsequently reducing A. The numerical 
results suggest that such an approach is viable, but further study is needed in order 
to develop a robust continuation strategy. 

The penalty formulation essentially relaxes the constraints and therefore allows 
for errors in the physics as well as the data. As a result, the penalty formulation 
leads to reduced sensitivity of the final reconstruction to the initial guess. Further 
investigation is needed to characterize this robustness. 

Finally, the Hessian of the penalty objective exhibits additional structure that 
could potentially be exploited. In particular, the penalty-method GN Hessian is full 
rank and allows for a natural sparse approximation H\ % A G T G (cf. equation 181. 
The reduced GN Hessian, on the other hand, has rank of at most ML and does not 
permit such a natural sparse approximation. 


8. Conclusions 

We have presented a penalty method for PDE-constrained optimization with linear 
PDEs with applications to inverse problems. The method is based on a quadratic 
penalty formulation of the constrained problem. This reformulation results in a an 
unconstrained optimization problem in both the parameters and the state variables. 
To avoid having to store and update the state variables as part of the optimization, 
we explicitly eliminate the state variables by solving an over determined linear system. 
The proposed method combines features from both the all-at-once approach, in which 
the states and parameters are updated simultaneously, and the conventional reduced 
approach, in which the PDE-constraints are eliminated explicitly. While having a 
similar computational complexity as the conventional reduced approach, the penalty 
approach explores a larger search space by not satisfying the PDE-constraints exactly. 

We show that we can (theoretically) find a stationary point of the Lagrangian of 
the constrained problem within a given tolerance as long as the penalty parameter, A, 
is chosen large enough. While theoretically we need A t oo, we can suffice with solving 
the problem for a finite A to reach the stationary point within finite precision. 

The main algorithmic difference with the conventional reduced approach is the 
way the states are eliminated from the problem. Instead of solving the PDEs, we 
formulate an overdeter mined system of equations that consists of the discretized PDE 
and the measurements. We discuss the properties of this augmented system and 
show with a few numerical examples that both the structure of the system as well 
as the eigenvalues are not altered dramatically as compared the original PDE. Thus, 
it is plausible that the augmented system can be solved as efficiently using the same 
approach as is used for the original PDE. 

The numerical examples show that very good results can be obtained by using 
even a single, relatively small, value of A. An ad-hoc continuation strategy further 
shows that it is viable to gradually increase A in order to reach the desired tolerance. 
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The numerical examples further show that when using the penalty formulation, 
the optimization problem may actually be less non-linear and that in some cases a 
better parameter reconstruction is obtained as compared to the conventional reduced 
approach. In particular, the results show that the penalty method is not overly 
sensitive to noise and less sensitive to the initial model than the conventional reduced 
approach. 

Thus, the proposed approach is a viable alternative to the conventional reduced 
approach for solving inverse problems with PDE-constraints. 
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small 2D 

large 2D 

industrial 3D 

K 

10 2 

10 a 

To^ 

L 

10 2 

10 3 

10 6 

M 

l() fi 

10 9 

10 12 

N 

l() :i 

10 6 
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Table 1. Typical size of seismic inverse problem in terms K: of the number of 
experiments, L: the number of measurements per experiment, N: the number of 
discretization points and M : the number of parameters. 


A = 0.1 A = 1 A = 10 
L = 1 9.83e-01 9.84e-01 9.89e-01 

L = 10 9.68e-02 5.07e-01 9.'lle-01 

L = 20 9.19e-02 5.01e-01 9.09e-01 


Table 2. Ratio of the condition numbers of A T A + XPJ^Pl and A T A for various 
A and L, where A is a finite-difference discretization of t(107r) — d 2 on x £ [0,1] 
and Pl is a restricted identity matrix of rank L. 


A = 0.1 A = 1 A = 10 
L = 1 1.80e-01 5.44o-01 9.17e-01 

L = 10 1.03e-01 5.06e-01 9.10e-01 

L = 20 9.10e-02 5.00e-01 9.09e-01 


Table 3. Ratio of the condition numbers of A T A + A P]^Pl and A T A for various 
A and L. where A is a finite-difference discretization of (107r) 2 m-f-d 2 on x € [0,1] 
and Pl is a restricted identity matrix of rank L. 



# PDE’s 

Storage 

Gauss-Newton update 

penalty 

I< 

N + M 

solve matrix-free linear system 
in M unknowns, requires K 
(overdetermined) PDE solves per 
mat-vec 

reduced 

2 1< 

2N + M 

solve matrix-free linear system in 
M unknowns, requires 2 K PDE 
solves per mat-vec 

all-at-once 

0 

2 KN + M 

solve sparse symmetric, possibly 
indefinite system in (2KN + 
M) x (2 KN + M) unknowns 


Table 4. Leading order computational and storage costs per iteration of different 
methods; K denotes the number of experiments and N denotes the number of 
gridpoints and M denotes the number of parameters. 


reduced A = 0.1 A = 1 A = 10 increasing A 
iterations 7 5 6 7 6 

PDE solves 496 222 223 280 292 


Table 5. Costs of the ID DC resistivity inversion. 
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reduced A = 0.1 A = 1 A = 10 increasing A 
iterations 6 4 5 6 7 

PDE solves 172 38 56 82 99 


Table 6. Costs of the 2D ultrasound inversion with a GN method. 


reduced A = 0.1 A = 1 A = 10 
iterations 36 18 29 34 

PDE solves 76 21 31 35 


Table 7. Costs of the 2D ultrasound inversion with a QN method. 
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Figure 1. Eigenvalues of the augmented system, A T A -f XP^Pl, for various A 
and L, where A is a finite-difference discretization of i(10tt) — 0% on x € [0,1] and 
Pi, is a restricted identity matrix of rank L. For comparison, the eigenvalues of 
the original system A T A are also shown. 




Figure 2. Eigenvalues of the augmented system, A T A -f XPfP^, for various A 
and L. where /I is a finite-difference discretization of (107T ) 2 m + on x 6 [0, 1] 
and Pl is a restricted identity matrix of rank L. For comparison, the eigenvalues 
of the original system A*A are also shown. 



Figure 3. Convergence history and reconstructions for ID resistivity problem. 
Even though the penalty method does not converge to same tolerance as the 
reduced method in terms of the gradient of the Lagrangian, the resulting 
parameter estimates are almost the same. 
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Figure 4. Ground truth model ( s~/km ~) and locations of the sources (*) and 
receivers (V) 



x,|mj *jM 


Figure 5. Perturbations vi and V 2 used to plot the misfit <£(m* •+• alvi -f 0 . 0 V 0 ) 
and 4>\(m* + aivi + 02 V 2 ) in figure |fij 



reduced A = 0.1 A =1 A = 10 

Figure 6. Misfit in the direction of the perturbations shown in figure[5] For small 
A, the reduced penalty objective (f>\ is less non-linear than the reduced objective 
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Figure 7. Convergence history, reconstruction error and reconstructions for data 
without noise using a GN method. Even though the penalty method does not 
converge to same tolerance as the reduced method in terms of the gradient of the 
Lagrangian, the resulting parameter estimates are almost the same. 



Figure 8. Convergence history, reconstruction error and reconstructions for data 
without noise using a QN method. Even though the penalty method does not 
converge to same tolerance as the reduced method in terms of the gradient of the 
Lagrangian, the resulting parameter estimates are almost the same. 
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Figure 9. Convergence history, reconstruction error and reconstructions for data 
with 10% Gaussian noise using a QN method. Even though the penalty method 
does not converge to same tolerance as the reduced method in terms of the gradient 
of the Lagrangian, the resulting parameter estimates are almost the same. In fact, 
for small A, result is even a little better. 



Figure 10. Convergence history, reconstruction error and reconstructions for 
data with 20% Gaussian noise using a QN method. Even though the penalty 
method does not converge to same tolerance as the reduced method in terms of 
the gradient of the Lagrangian, the resulting parameter estimates are almost the 
same. In fact, for small A, result is even a little better. 
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Figure 11. Ground truth (s 2 /km 2 ) (top) with locations of the sources (*) and 
receivers (v) and initial iterates I (middle, left) and II (bottom, right). The 
bottom row shows the data for a source in the centre for the ground truth (dashed 
line) as well as the data for the two initial iterates. The first initial iterate produces 
data that differs only slightly from the observed data and inversion is considered 
to be easy. The second initial iterate produces data that is shifted significantly 
with respect to the observed data and inversion is considered to be difficult. 
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A = 1 


A = 10 


Figure 12. QN reconstructions after 50 iterations and corresponding data for 
a source in the center, starting from the initial iterate I. Both the penalty and 
reduced methods converge to the same final iterate when starting from this initial 
guess and are able to fit the data equally well. 
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Figure 13. QN reconstructions after 50 iterations and corresponding data for a 
source in the center, starting from the initial iterate II. For small A, the penalty 
method converges to the same final iterate as when starting from initial guess 
II, showing stability against changes in the initial guess. The reduced method 
converges to a completely different model, suggesting that the optimization 
method is stuck in a local minimum. This is confirmed when looking at the 
data-fit. 




IPV-dllj ||P T u k -d|| 2 


Figure 14. Convergence history in terms of the data-fit and distance to the 
constraint, starting from initial iterate I (left) and starting from initial iterate II 
(right). These plots show that, for small A, the penalty method is able to reduce 
both the data and PDE misfit to the same level when starting from either initial 
guess. The reduced method, however, cannot reduce the data misfit to the same 
level when starting from initial guess II, suggesting that it got stuck in a local 
minimum. 
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